NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Graph-based self-supervised learning for repeat detection in metagenomic assembly

https://doi.org/10.1101/gr.279136.124

Azizpour, Ali; Balaji, Advait; Treangen, Todd J; Segarra, Santiago (July 2024, Genome research)

Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with predefined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
more » « less
Full Text Available
Multiple genome alignment in the telomere-to-telomere assembly era

https://doi.org/10.1186/s13059-022-02735-6

Kille, Bryce; Balaji, Advait; Sedlazeck, Fritz J.; Nute, Michael; Treangen, Todd J. (December 2022, Genome Biology)

Abstract With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.
more » « less
Full Text Available
SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning

https://doi.org/10.1186/s13059-022-02695-x

Balaji, Advait; Kille, Bryce; Kappell, Anthony D.; Godbold, Gene D.; Diep, Madeline; Elworth, R. A.; Qian, Zhiqin; Albin, Dreycey; Nasko, Daniel J.; Shah, Nidhi; et al (December 2022, Genome Biology)

Abstract The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .
more » « less
Full Text Available
KOMB: K-core based de novo characterization of copy number variation in microbiomes

https://doi.org/10.1016/j.csbj.2022.06.019

Balaji, Advait; Sapoval, Nicolae; Seto, Charlie; Leo Elworth, R.A.; Fu, Yilei; Nute, Michael G.; Savidge, Tor; Segarra, Santiago; Treangen, Todd J. (January 2022, Computational and Structural Biotechnology Journal)

Full Text Available
Current progress and open challenges for applying deep learning across the biosciences

https://doi.org/10.1038/s41467-022-29268-7

Sapoval, Nicolae; Aghazadeh, Amirali; Nute, Michael G.; Antunes, Dinler A.; Balaji, Advait; Baraniuk, Richard; Barberan, C. J.; Dannenfelser, Ruth; Dun, Chen; Edrisi, Mohammadamin; et al (December 2022, Nature Communications)

Abstract Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.
more » « less
Full Text Available
To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

https://doi.org/10.1093/nar/gkaa265

Elworth, R A; Wang, Qi; Kota, Pavan K; Barberan, C J; Coleman, Benjamin; Balaji, Advait; Gupta, Gaurav; Baraniuk, Richard G; Shrivastava, Anshumali; Treangen, Todd J (April 2020, Nucleic Acids Research)
null (Ed.)
Abstract As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
more » « less
Full Text Available

Search for: All records